library(ggplot2)
In homework 2, you were investigating the de novo mutation rate of R. Waterstonii. In a given generation, you observed the following mutation counts in 12 different isolates:
mutations = c(12,15,10,4,6,15,18,12,7,11,5,14)
mutation_data = data.frame(mutations)
mutation_data
Obtain 1,000 bootstrap samples from your data and recompute λ* for these bootstrap examples.
I will use the boot package to randomly resample this data 1000 times. Each time, I will compute lambda as the mean of the sample. The resulting distribtuion is:
library(boot)
set.seed(1)
samplemean = function(x, d) {
return(mean(x[d]))
}
lambda_ests = boot(data = mutation_data$mutations, statistic = samplemean, R = 1000)
lambda_ests.vals = data.frame(t = lambda_ests$t[1:1000])
print(mean(lambda_ests.vals$t))
## [1] 10.69767
ggplot(lambda_ests.vals, aes(x=t)) + geom_histogram() +
labs(title="Lambda estimate", subtitle='1000 bootstrap samplings', y="count", x = "lambda")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
**Compute the variance of λ*.**
var(lambda_ests$t[1:1000])
## [1] 1.493964
Generate a 95% confidence interval for λ* using the normal approximation method. The lambda estimates should be approximatley normally distributed, because they are a stastic calculated over many random samples.
The edges of the 95% confidence interval of the normal approximation of this distribution are:
m = mean(lambda_ests$t[1:1000])
v = var(lambda_ests$t[1:1000])
qnorm(0.025, mean = m, sd = sqrt(v))
## [1] 8.302046
qnorm((1-0.025), mean = m, sd = sqrt(v))
## [1] 13.09329
Generate a 95% confidence interval for λ* using the percentile method.
I want to find the values that lie, within our statistic distribution, at the edge of the 95% confidence interval.
This is simply the 25th (0.025%) and 975th (97.5%) term in the ordered list of the 1000 lambda estimates.
lambda_ests.sorted = sort(lambda_ests$t[1:1000])
lambda_ests.sorted[25]
## [1] 8.333333
lambda_ests.sorted[(1000-25)]
## [1] 13.08333
Compare the results from c and d. Which do you prefer, and why?
In this case, I prefer the percentile approach. While the values are similar between the normal approximation and percentile approach, since the percential approach was computationally trivial, I prefer to use values that are present in the bootstrap set to describe the bootstrap set, rather than the inferred values of the normal distribution.
You are studying the effect of a new enzyme on the number of PCR cycles required for a reaction. You split each of your 16 samples into two and perform the reactions with the old and new enzymes. The following table summarizes your data on the number of cycles required for each pair of reactions using the old and new enzymes:
old = c(7,5,7,7,10,12,5,13,11,9,9,8,8,13,6,10)
new = c(6,8,8,2,9,9,7,7,4,7,8,7,7,11,8,6)
enzyme_data = data.frame(old, new)
enzyme_data
Briefly describe (in words) your strategy. Why did you choose this approach?
These data are paired. Each observation is from a single sample tested with two different conditions (old vs new enzyme). I will use a paired t-test to evaluate whether the conditions affect the observations.
What are the null and alterative hypotheses for this test
The null hypothesis for this test is that the mean difference between pairs is 0. The alternative hypothesis is that the mean difference between pairs is not 0.
What assumptions does your approach make? Are they all satisfied? How do you know?
-The data are continuous; this is true, assuming we know our measurment technique allows continous observations.
-Each (paired) observation is independent; this is also true, assuming we know our sampling method is not biased.
-The data are approximately normally distributed; by eye, this is true for the ‘new’ data, and it is unclear whether it is true for the ‘old’ data.
old_plot = ggplot(enzyme_data, aes(x=old)) + geom_bar()
new_plot = ggplot(enzyme_data, aes(x=new)) + geom_bar()
old_plot
new_plot
What is the p-value from your analysis?
paired_enzyme_test = t.test(enzyme_data$old, enzyme_data$new, paired = TRUE)
paired_enzyme_test$p.value
## [1] 0.03890274
What do you conclude (be precise about the conclusion)?
Because the p value is less than 0.05, I conclude that the mean of differences between the pairs is not 0. Because our experimental design only changed the enzyme in the reaction, we can infer causality. I infer that changing the enzyme produced a difference in the number of cycles required for each pair of reactions. Because this was a two-sided test, I cannot conclude the direction or magnitude of the difference, only that a difference is present.
Briefly describe (in words) your strategy. Why did you choose this approach? These data are paired. Each observation is from a single sample tested with two different conditions (old vs new enzyme). In the previous approach, paired t-test, I found that the data are not clearly normally distributed. Now, I will use a non-parametric test, the Wilcoxon Signed Rank Test, which does not require normally distributed data to be valid.
What are the null and alterative hypotheses for this test The null hypothesis is that the populations the two samples are drawn from have the same median. The alternative hypothesis is that they are drawn from populations with different medians.
What assumptions does your approach make? Are they all satisfied? How do you know?
-The data are continuous; this is true, assuming we know our measurment technique allows continous observations.
-Each (paired) observation is independent; this is also true, assuming we know our sampling method is not biased.
-The data have a symmmetric distribution; this is true for both ‘old’ and ‘new’ data sets.
What is the p-value from your analysis?
paired_enzyme_wilcoxtest = wilcox.test(enzyme_data$old, enzyme_data$new, paired = TRUE)
## Warning in wilcox.test.default(enzyme_data$old, enzyme_data$new, paired =
## TRUE): cannot compute exact p-value with ties
paired_enzyme_wilcoxtest$p.value
## [1] 0.06432261
What do you conclude (be precise about the conclusion)? The p-value given for the non-parametric test is 0.06, which is above the common threshold of 0.05. This test is less powerful than the t-test. However, this tests assumptions were met, while it is not clear that the t-test’s assumptions were met.
Given this value, I would think it is likely that the medians of the populations the old and new data are drawn from are different; because we know the experimental design, we can say that the median number of cycles needed to complete the reaction with the new enzyme is likely different from the median number needed to complete the reaction with the old enzyme. However, I would not be very confident in this conclusion.
After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive for a serious disease, and that the test is 99% accurate (i.e. the probability of testing positive given that you have the disease is 0.99, as is the probability of testing negative given that you don’t have the disease). The good news is that this is a rare disease, striking only 1 in 10,000 people. Why is it good news that the disease is rare? What are the chances that you actually have the disease?
Given that you tested positive, the chance that you actually have the disease depends on the accuracy of the test (a more specific test increases the chance that you have the disease) and the chance that you have the disease in the absence of testing (the background rate of the disease). Since the background rate is very low, even though you tested positive on the test, it is still overwhelmingly likely that you DO NOT have the disease:
A = you have the disease
B = you tested positive
\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]
In the absence of testing, P(A) is = 1/10,000.
Since the test is 99% sensitive, P(B|A) is 0.99.
P(B) is calculated as the probability that you test positive given that you have the disease–P(B|A)–times the probability that you have the disease–P(A)–plus the probability that you test positive given that you don’t have the disease–P(B|not A)–times the probability that you don’t have the disease–P(not A): \[
P(B) = P(B|A) \cdot P(A) + P(B|not A) \cdot P(not A)
\]
The probability that you don’t have the disease–P(not A)–is 1-(1/10,000), or 9,999/10,000. The probability that you test positive for the test given that you don’t have the disease–P(B|not A)–is 1-0.99 (specificity), or 0.01.
Finally, we can calculate P(B):
\[ P(B) = 0.99 \cdot \frac{1}{10,000} + 0.01 \cdot \frac{9,999}{10,000} = 0.010098 \]
Therefore, the chance that you actually have the disease, now that you test positive is: \[ P(A|B) = \frac{0.99 * 0.0001}{0.010098} = 0.0098 \]
This is about 100X higher than baseline, but it is still >99% likely that you DO NOT have the disease.
Prove the conditionalized version of the general product rule:
Prove the conditionalized version of Bayes’ rule: